Explore different perspectives and approaches to create more truthful and enlightening data visualisation
Anticipating rapid growth, the city of Engagement, Ohio USA is doing a participatory urban planning exercise to understand the current state of the city and identify opportunities for future growth. About 1000 representative residents in this modest-sized city have agreed to provide data using the city’s urban planning app, which records the places they visit, their spending, and their purchases, among other things. From these volunteers, the city will have data to assist with their major community revitalization efforts, including how to allocate a very large city renewal grant they have recently received.
Economic considers the financial health of the city. How does the financial health of the residents change over the period covered by the dataset? How do wages compare to the overall cost of living in Engagement? Are there groups that appear to exhibit similar patterns?
In this exercise, we will explore different perspectives and approaches to create more enlightening data visualisation on dataset VAST Challenge 2022. The operation was carried out on Rstudio and main packages used are tidyverse and ggplot2 extensions.
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.
The chunk code on the right will do the trick.
packages = c('tidyverse', 'knitr', 'ggdist', 'ggridges',
'scales', 'grid', 'gridExtra','plotly',
'ggrepel', 'formattable', 'patchwork',
'lubridate', 'data.table')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The code chunk below imports participants.csv and FinancialJournal.csv into R environment using read_csv() function of readr package.
participants <- read_csv('data/Participants.csv')
financial <- read_csv('data/FinancialJournal.csv')
It is always a good practice to examine the imported data frame before further analysis is performed.
For example, kable() can be used to review the structure of the imported data frame.
Let’s take an overview of the datasets.
kable(head(participants))
| participantId | householdSize | haveKids | age | educationLevel | interestGroup | joviality |
|---|---|---|---|---|---|---|
| 0 | 3 | TRUE | 36 | HighSchoolOrCollege | H | 0.0016267 |
| 1 | 3 | TRUE | 25 | HighSchoolOrCollege | B | 0.3280865 |
| 2 | 3 | TRUE | 35 | HighSchoolOrCollege | A | 0.3934696 |
| 3 | 3 | TRUE | 21 | HighSchoolOrCollege | I | 0.1380634 |
| 4 | 3 | TRUE | 43 | Bachelors | H | 0.8573967 |
| 5 | 3 | TRUE | 32 | HighSchoolOrCollege | D | 0.7729578 |
kable(head(financial))
| participantId | timestamp | category | amount |
|---|---|---|---|
| 0 | 2022-03-01 | Wage | 2472.50756 |
| 0 | 2022-03-01 | Shelter | -554.98862 |
| 0 | 2022-03-01 | Education | -38.00538 |
| 1 | 2022-03-01 | Wage | 2046.56221 |
| 1 | 2022-03-01 | Shelter | -554.98862 |
| 1 | 2022-03-01 | Education | -38.00538 |
In order to understand the financial health of the residents change over the period, we need to derive income, overall cost and balance of residents in a monthly basis.
Monthly income/cost for residents need to be derived to view the change over recorded 15 months. Code chunk below shows how we change time format to monthly basis.
# A tibble: 1,856,330 x 4
participantId category amount yearmonth
<dbl> <chr> <dbl> <chr>
1 0 Wage 2473. 2022.03
2 0 Shelter -555. 2022.03
3 0 Education -38.0 2022.03
4 1 Wage 2047. 2022.03
5 1 Shelter -555. 2022.03
6 1 Education -38.0 2022.03
7 2 Wage 2437. 2022.03
8 2 Shelter -557. 2022.03
9 2 Education -12.8 2022.03
10 3 Wage 2367. 2022.03
# ... with 1,856,320 more rows
Firstly, we need to use group_by to group individual expense and income category. And then use summarise function to summarize each category.
summarizedFinancial <- monthlyFinancial %>%
group_by(participantId, category, yearmonth) %>%
summarise(monthly_financial = sum(amount))
summarizedFinancial
# A tibble: 55,498 x 4
# Groups: participantId, category [4,142]
participantId category yearmonth monthly_financial
<dbl> <chr> <chr> <dbl>
1 0 Education 2022.03 -76.0
2 0 Education 2022.04 -38.0
3 0 Education 2022.05 -38.0
4 0 Education 2022.06 -38.0
5 0 Education 2022.07 -38.0
6 0 Education 2022.08 -38.0
7 0 Education 2022.09 -38.0
8 0 Education 2022.10 -38.0
9 0 Education 2022.11 -38.0
10 0 Education 2022.12 -38.0
# ... with 55,488 more rows
Then, the dataframe need to be pivoted using code chunk below.
Financial <- summarizedFinancial %>%
pivot_wider(names_from = category, values_from = monthly_financial)
Financial[is.na(Financial)] = 0
Financial
# A tibble: 13,331 x 8
# Groups: participantId [1,011]
participantId yearmonth Education Food Recreation Shelter Wage
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 2022.03 -76.0 -268. -349. -1110. 11932.
2 0 2022.04 -38.0 -266. -219. -555. 8637.
3 0 2022.05 -38.0 -265. -383. -555. 9048.
4 0 2022.06 -38.0 -257. -466. -555. 9048.
5 0 2022.07 -38.0 -270. -1070. -555. 8637.
6 0 2022.08 -38.0 -262. -314. -555. 9459.
7 0 2022.09 -38.0 -256. -295. -555. 9048.
8 0 2022.10 -38.0 -267. -25.0 -555. 8637.
9 0 2022.11 -38.0 -261. -377. -555. 9048.
10 0 2022.12 -38.0 -266. -357. -555. 9048.
# ... with 13,321 more rows, and 1 more variable:
# RentAdjustment <dbl>
To show the change of financial situation of residents during this 15 months,we need to calculate the monthly income , monthly living cost as well as monthly balance and then used them to visualize in the next part.
Residents’ monthly income is derived by calculating the sum of wage.
Cost of living is made up of expenses from education, food, recreation, shelter, and offset the rent adjustment. Residents’ monthly cost of living is derived by calculating the sum of above expenses.
FINANCIAL <- Financial %>%
mutate(monthly_cost = Education + Food + Recreation
+ Shelter + RentAdjustment) %>%
mutate(monthly_income = Wage) %>%
mutate(monthly_balance = monthly_income + monthly_cost)
FINANCIAL
# A tibble: 13,331 x 11
# Groups: participantId [1,011]
participantId yearmonth Education Food Recreation Shelter Wage
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 2022.03 -76.0 -268. -349. -1110. 11932.
2 0 2022.04 -38.0 -266. -219. -555. 8637.
3 0 2022.05 -38.0 -265. -383. -555. 9048.
4 0 2022.06 -38.0 -257. -466. -555. 9048.
5 0 2022.07 -38.0 -270. -1070. -555. 8637.
6 0 2022.08 -38.0 -262. -314. -555. 9459.
7 0 2022.09 -38.0 -256. -295. -555. 9048.
8 0 2022.10 -38.0 -267. -25.0 -555. 8637.
9 0 2022.11 -38.0 -261. -377. -555. 9048.
10 0 2022.12 -38.0 -266. -357. -555. 9048.
# ... with 13,321 more rows, and 4 more variables:
# RentAdjustment <dbl>, monthly_cost <dbl>, monthly_income <dbl>,
# monthly_balance <dbl>
In order to show the income and consumption patterns in different groups, we will combine FINANCIAL and participants dataframe together.
combine <- FINANCIAL %>%
left_join(participants, by = "participantId")
combine
# A tibble: 13,331 x 17
# Groups: participantId [1,011]
participantId yearmonth Education Food Recreation Shelter Wage
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 2022.03 -76.0 -268. -349. -1110. 11932.
2 0 2022.04 -38.0 -266. -219. -555. 8637.
3 0 2022.05 -38.0 -265. -383. -555. 9048.
4 0 2022.06 -38.0 -257. -466. -555. 9048.
5 0 2022.07 -38.0 -270. -1070. -555. 8637.
6 0 2022.08 -38.0 -262. -314. -555. 9459.
7 0 2022.09 -38.0 -256. -295. -555. 9048.
8 0 2022.10 -38.0 -267. -25.0 -555. 8637.
9 0 2022.11 -38.0 -261. -377. -555. 9048.
10 0 2022.12 -38.0 -266. -357. -555. 9048.
# ... with 13,321 more rows, and 10 more variables:
# RentAdjustment <dbl>, monthly_cost <dbl>, monthly_income <dbl>,
# monthly_balance <dbl>, householdSize <dbl>, haveKids <lgl>,
# age <dbl>, educationLevel <chr>, interestGroup <chr>,
# joviality <dbl>
To visualise the financial change during this 15 months, we will use ridge plot to show the economic situation
p1 <- ggplot(combine,
aes(x=Wage,
y=combine$yearmonth,
fill = factor(stat(quantile)))) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = TRUE,
quantiles = 4,
quantile_lines = TRUE) +
scale_fill_viridis_d(name = "Quartiles") +
labs(x= "Wage",
y= "Time",
title="Distribution of Residents' Wage")
p1
According to above ridge plot, we can know that residents’ wage in 2022 March is higher than the following 14 months.
p2 <- ggplot(combine,
aes(x = combine$monthly_balance,
y = combine$yearmonth))+
geom_density_ridges(jittered_points = TRUE,
position = position_points_jitter(width = 0.05,
height = 0),
point_shape = '|',
point_size = 3,
point_alpha = 1,
alpha = 0.7,) +
stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE) +
geom_density_ridges_gradient(scale = 2, rel_min_height = 0.01) +
scale_fill_viridis_c(name = "Monthly Balance", direction = -1) +
theme(axis.title.y=element_text(angle=0),
axis.line = element_line(color='grey'),
plot.title = element_text(hjust = 0.5),
axis.title.y.left = element_text(vjust = 0.5,),
axis.text = element_text(face="bold")) +
labs(x= "Monthly Balance",
y= "Time",
title="Distribution of Residents' Monthly Balance")
p2
Accordingly, the distribution of monthly balance is in a similar pattern. And the dot plot delow the ridge plot shows that people with higher balance is affected even more serious.
Residents
edu_financial <- combine %>%
select(educationLevel, yearmonth, monthly_income,
monthly_cost, monthly_balance) %>%
group_by(educationLevel) %>%
summarise(avgIncome = mean(monthly_income),
medianIncome =median(monthly_income),
avgCost = mean(monthly_cost),
avgBalance = mean(monthly_balance))
edu_financial
# A tibble: 4 x 5
educationLevel avgIncome medianIncome avgCost avgBalance
<chr> <dbl> <dbl> <dbl> <dbl>
1 Bachelors 4989. 4296. -1423. 3566.
2 Graduate 6154. 5394. -1478. 4677.
3 HighSchoolOrCollege 3151. 2785. -1374. 1777.
4 Low 2661. 2373. -1260. 1401.
# A tibble: 150 x 5
# Groups: interestGroup [10]
interestGroup yearmonth avgIncome avgCost avgBalance
<chr> <chr> <dbl> <dbl> <dbl>
1 A 2022.03 6363. -2091. 4272.
2 A 2022.04 3961. -1441. 2520.
3 A 2022.05 4140. -1377. 2762.
4 A 2022.06 4132. -1354. 2778.
5 A 2022.07 3975. -1386. 2588.
6 A 2022.08 4308. -1340. 2967.
7 A 2022.09 4134. -1354. 2780.
8 A 2022.10 3969. -1375. 2594.
9 A 2022.11 4129. -1314. 2815.
10 A 2022.12 4147. -1363. 2784.
# ... with 140 more rows